In this notebook, we will work with the following:
By convention, imports go at the top of a Python script or notebook (see PEP 8). In relevant part:
Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants.
Imports should be grouped in the following order:
- Standard library imports.
- Related third party imports.
- Local application/library specific imports.
You should put a blank line between each group of imports.
# standard library
import sys
import time
# third party
import numpy as np
import pandas as pd
import plotly.express as px
from textblob import TextBlob
pd.set_option("mode.copy_on_write", True)
Note a few things in the block above.
import sys is the simplest version.pandas is often imported as pd, both because it is used often and also by convention (see pandas documentation).TextBlob from the package textblob.# are comments. Those lines are not executed by Python, and they are useful for us to make notes about what we are doing.Let's see how these work in action.
print(sys.executable)
/usr/local/bin/python
Note that to find the contents of this attribute executable within the sys package, we have to use the package namespace sys. Most about namespaces below.
Somewhat abstactly, the python docs define a namespace as follows.
A namespace is a mapping from names to objects.
For our purposes, we can think of them as paths to get to tools of interest. This topic goes (much) deeper, but a more instrumental understanding is fine for our use.
If we want to know what is contained in a namespace, we can easily find out with the dir() built-in function.
dir(sys)
['__breakpointhook__', '__displayhook__', '__doc__', '__excepthook__', '__interactivehook__', '__loader__', '__name__', '__package__', '__spec__', '__stderr__', '__stdin__', '__stdout__', '__unraisablehook__', '_base_executable', '_clear_type_cache', '_current_exceptions', '_current_frames', '_debugmallocstats', '_framework', '_getframe', '_getquickenedcount', '_git', '_home', '_stdlib_dir', '_xoptions', 'abiflags', 'addaudithook', 'api_version', 'argv', 'audit', 'base_exec_prefix', 'base_prefix', 'breakpointhook', 'builtin_module_names', 'byteorder', 'call_tracing', 'copyright', 'displayhook', 'dont_write_bytecode', 'exc_info', 'excepthook', 'exception', 'exec_prefix', 'executable', 'exit', 'flags', 'float_info', 'float_repr_style', 'get_asyncgen_hooks', 'get_coroutine_origin_tracking_depth', 'get_int_max_str_digits', 'getallocatedblocks', 'getdefaultencoding', 'getdlopenflags', 'getfilesystemencodeerrors', 'getfilesystemencoding', 'getprofile', 'getrecursionlimit', 'getrefcount', 'getsizeof', 'getswitchinterval', 'gettrace', 'hash_info', 'hexversion', 'implementation', 'int_info', 'intern', 'is_finalizing', 'last_traceback', 'last_type', 'last_value', 'maxsize', 'maxunicode', 'meta_path', 'modules', 'orig_argv', 'path', 'path_hooks', 'path_importer_cache', 'platform', 'platlibdir', 'prefix', 'ps1', 'ps2', 'ps3', 'pycache_prefix', 'set_asyncgen_hooks', 'set_coroutine_origin_tracking_depth', 'set_int_max_str_digits', 'setdlopenflags', 'setprofile', 'setrecursionlimit', 'setswitchinterval', 'settrace', 'stderr', 'stdin', 'stdlib_module_names', 'stdout', 'thread_info', 'unraisablehook', 'version', 'version_info', 'warnoptions']
While we might think of namespaces as synonymous with packages, it's more general than that.
Individual objects have their own namespaces, like the TextBlob class we imported earlier.
dir(TextBlob)
['__add__', '__class__', '__contains__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_cmpkey', '_compare', '_create_sentence_objects', '_strkey', 'analyzer', 'classify', 'correct', 'detect_language', 'ends_with', 'endswith', 'find', 'format', 'index', 'join', 'json', 'lower', 'ngrams', 'noun_phrases', 'np_counts', 'np_extractor', 'parse', 'parser', 'polarity', 'pos_tagger', 'pos_tags', 'raw_sentences', 'replace', 'rfind', 'rindex', 'sentences', 'sentiment', 'sentiment_assessments', 'serialized', 'split', 'starts_with', 'startswith', 'strip', 'subjectivity', 'tags', 'title', 'to_json', 'tokenize', 'tokenizer', 'tokens', 'translate', 'translator', 'upper', 'word_counts', 'words']
We can also look at what is in the global namespace.
dir()
['In', 'Out', 'TextBlob', '_', '_5', '_6', '__', '___', '__builtin__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', '__vsc_ipynb_file__', '_dh', '_i', '_i1', '_i2', '_i3', '_i4', '_i5', '_i6', '_i7', '_ih', '_ii', '_iii', '_oh', 'exit', 'get_ipython', 'np', 'open', 'pd', 'px', 'quit', 'sys', 'time']
Note that we see the sys and pd packages that we imported, and we also see the TextBlob class that we imported from its package.
A fun import is The Zen of Python philosophy, which can be accessed by importing this.
Note that I'm slightly breaking the rules above for the purposes of illustration.
import this # noqa: E402, F401
The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!
As we'll talk about later, Jupyter notebooks are a great interface for working with Python (or R or a number of other kernels). However, they are not the only game in town.
The Jupyter lab interface and notebooks provide a number of conveniences for research.
Jupyter uses the simple markdown syntax for formatting text. There are some extensions and differences from original markdown, so you may find the Jupyter Notebooks docs to be a better reference.
**Bold**.*Italicise*.***Do both***.#.# First heading
## Second heading
### Third heading
- Bullets. . ..Numbered lists start with a number, period, and a space:
Note that they all start with 1., and markdown handles numbering for us.
We could, of course, number them ourselves.
1. First
1. Second
1. Third
We can also nest lists and types by indenting:
- Bullet
1. Nested list item
1. Another one
- Another bullet
1. More lists
- More bullets
We can reference code in two ways.
First, we can use inline code like import this by using backticks ` to enclode the code: `import this`.
Second, we can make code blocks by using beginning and ending lines with three backticks: ```.
Do note that I'm having to be tricky to display backticks inside of code.
def f_to_c(temp_f):
return (temp_f - 32) * 5/9
We can make it a little nicer (with syntax highlighting) by adding the code type to the first line: ```python.
def f_to_c(temp_f):
return (temp_f - 32) * 5/9
We can add links, like one to my github page, using the text in brackets followed by the link in parentheses: [github page](https://github.com/jtkiley).
We can add images by using similar syntax to point to an image: .
Similar to code, we can also use math and equations inline and in blocks.
For inline math, like the union of a set $S \cup T = \{x \mid x \in S \vee x \in T\}$, we can use a single dollar sign to denote math: $S \cup T = \{x \mid x \in S \vee x \in T\}$.
We can also use blocks by using beginning and ending lines with two dollar signs: $$.
$$
\operatorname{MSE}=\frac{1}{n}\sum_{i=1}^n(Y_i-\hat{Y_i})^2
$$
There are many math features, including matrices:
$$ A = \begin{pmatrix} \underbrace{\begin{matrix} a_{0,0} \\ a_{1,0} \\ \vdots \\ a_{m-1,0} \end{matrix}}_{a_0} & \underbrace{\begin{matrix} a_{0,1} \\ a_{1,1} \\ \vdots \\ a_{m-1,1} \end{matrix}}_{a_1} & \begin{matrix} \dots \\ \dots \\ \ddots \\ \dots \end{matrix} & \underbrace{\begin{matrix} a_{0,n-1} \\ a_{1,n-1} \\ \vdots \\ a_{m-1,n-1} \end{matrix}}_{a_{n-1}} \\ \end{pmatrix} $$$$ A = \begin{pmatrix}
\underbrace{\begin{matrix} a_{0,0} \\ a_{1,0} \\ \vdots \\ a_{m-1,0} \end{matrix}}_{a_0} &
\underbrace{\begin{matrix} a_{0,1} \\ a_{1,1} \\ \vdots \\ a_{m-1,1} \end{matrix}}_{a_1} &
\begin{matrix} \dots \\ \dots \\ \ddots \\ \dots \end{matrix} &
\underbrace{\begin{matrix} a_{0,n-1} \\ a_{1,n-1} \\ \vdots \\ a_{m-1,n-1} \end{matrix}}_{a_{n-1}} \\
\end{pmatrix}
$$
We can also display graphics that are output from our work with data.
# Create some random data
data1 = pd.DataFrame(np.random.rand(200, 4), columns=[letter for letter in "ABCD"])
# Display the top of the dataframe
data1.head()
| A | B | C | D | |
|---|---|---|---|---|
| 0 | 0.241023 | 0.762872 | 0.763842 | 0.557562 |
| 1 | 0.787578 | 0.692489 | 0.639784 | 0.742105 |
| 2 | 0.731753 | 0.116221 | 0.260077 | 0.063716 |
| 3 | 0.981630 | 0.203194 | 0.927129 | 0.178575 |
| 4 | 0.104788 | 0.826646 | 0.336700 | 0.836136 |
# Make a histogram of the columns
px.histogram(data1, x="A").show()
fig2 = px.scatter_matrix(data1).show()
px.scatter_3d(data1, x="A", y="B", z="C", color="D").show()
For many examples of really cool vizualizations that are easy to do (and have code samples), see the plotly express documentation.
There are many forms of cell magics that provide convenience features.
If you find yourself getting errors for a file not being found, it may help to know where the working directory is.
You can use the %pwd magic.
%pwd
'/workspaces/carma_python/notebooks'
A really common issue with large text datasets is that some things take a long time to run.
To know how long that is, we can use the %%time magic to get the time a cell takes to run.
Do note how we're using two percent signs: %%.
That makes the magic apply to the cell, instead of just the rest of the line.
%%time
# Use time.sleep() to make this cell take some time.
time.sleep(2)
print("Done!")
Done! CPU times: user 1.34 ms, sys: 260 µs, total: 1.6 ms Wall time: 2 s
We have a few options to share our notebooks with others.
.ipynb). Often, we will also need to send an environment file (environment.yml) and any data files that we rely on. Since data files may be large, you will often want to use a service like Dropbox to send them.There are a few challenges when using Jupyter.
.py files and using Jupyter for prototyping. Otherwise, this may not be a big concern.Overall, the Jupyter notebook is a great tool. Once you have some experience using them, you may find them fairly natural to work with.
Let's do a few exercises to reinforce the concepts we learned above.
We saw above how to import a package and inspect the namespace of it.
Later in the course, we will be using the pynytimes package.
Let's use it for an example here.
pynytimes package.# 1-1 code
# 1-2 code
Rememeber that we can use markdown to have rich text features. Let's try it.
https://github.com/michadenheijer/pynytimes.
Many coauthors will be unfamiliar with using Jupyter notebooks, and it may not be a good time investment to have them set it up and learn how it works, only to review your work. However, if they can read it, a lot of the code will make sense. An easy way to share it is to export an HTML file that they can view in a web browser.